Name | Version | Summary | date |
docstrange |
1.0.9 |
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR. |
2025-08-01 15:43:50 |
kreuzberg |
3.10.1 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-07-31 11:54:20 |
html-to-markdown |
1.9.0 |
A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options |
2025-07-29 15:40:00 |
document-data-extractor |
1.0.4 |
Best open-source document to markdown extractor for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-29 08:25:56 |
llm-data-converter |
2.2.0 |
Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-25 13:32:07 |
pyrtex |
0.1.6 |
A Python library for batch text extraction and processing using Google Cloud Vertex AI |
2025-07-20 15:59:06 |
pdf-tools-mcp |
0.1.3 |
A FastMCP-based PDF reading and manipulation tool server |
2025-07-18 03:32:46 |
mseep-kreuzberg |
3.8.2 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-07-17 03:32:28 |
pdfhandleretc |
0.1.1 |
Lightweight command-line and Python API toolkit for PDF text extraction, encryption, permissions, and more. |
2025-07-16 04:04:16 |
pdf-ocr-processor |
2.0.3 |
Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays |
2025-07-11 21:11:24 |
atai-pdf-tool |
0.1.0 |
A tool for parsing and extracting text from PDF files with OCR capabilities |
2025-02-27 11:15:46 |
fileseek |
0.1.3 |
FileSeek – AI-Powered Local Document Archive&Search |
2025-02-08 07:13:54 |
tikara |
0.1.5 |
The metadata and text content extractor for almost every file type. |
2025-01-26 23:33:40 |
pdf-parser-header-footer |
0.1.0 |
A Python package for processing PDFs with header and footer detection |
2025-01-14 16:10:34 |
spanish-pdf-parser |
0.1.0 |
A Python package for processing PDFs with header and footer detection |
2025-01-13 14:56:27 |
vlense |
0.1.4 |
A Python package to extract text from images and PDFs using Vision Language Model (VLM). |
2024-11-06 10:51:15 |
trafilatura |
1.12.2 |
Python package and command-line tool designed to gather text on the Web, includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments. |
2024-09-10 12:42:48 |